An  Analysis  of  Hierarchical  Genetic  Programming 

Justinian  P.  Rosea 

Technical  Report  566 
March  1995 


19950508  061 

UNIVERSITY  OF 

ROCHESTER 

COMPUTER  SCIENCE 

"‘‘■■'TlsfSPJtTON’’STMEM  I 

Approved  for  public  release;  I  "  ■  v  - 

Distribution  Ujalimited  j 


An  Analysis  of  Hierarchical  Genetic  Programming 


Justinian  P.  Rosea 
rosca@cs.rochester.edu 

The  University  of  Rochester 
Computer  Science  Department 
Rochester,  New  York  14627 

Technical  Report  566 

March  1995 


Abstract 

Hierarchical  genetic  programming  (HGP)  approaches  rely  on  the  discovery,  modification, 
and  use  of  new  functions  to  accelerate  evolution.  This  paper  provides  a  qualitative  explana¬ 
tion  of  the  improved  behavior  of  HGP,  based  on  an  analysis  of  the  evolution  process  from  the 
dual  perspective  of  diversity  and  causality.  From  a  static  point  of  view,  the  use  of  an  HGP 
approach  enables  the  manipulation  of  a  population  of  higher  diversity  programs.  Higher 
diversity  increases  the  exploratory  ability  of  the  genetic  search  process,  as  demonstrated  by 
theoretical  and  experimental  fitness  distributions  and  expanded  structural  complexity  of  in¬ 
dividuals.  From  a  dynamic  point  of  view,  this  report  analyzes  the  causality  of  the  crossover 
operator.  Causality  relates  changes  in  the  structure  of  an  object  with  the  effect  of  such 
changes,  i.e.  changes  in  the  properties  or  behavior  of  the  object.  The  analyses  of  crossover 
causality  suggests  that  HGP  discovers  and  exploits  useful  structures  in  a  bottom-up,  hi¬ 
erarchical  manner.  Diversity  and  causality  are  complementary,  affecting  exploration  and 
exploitation  in  genetic  search.  Unlike  other  machine  learning  techniques  that  need  extra 
machinery  to  control  the  tradeoff  between  them,  HGP  automatically  trades  off  exploration 
and  exploitation. 


This  material  is  bcised  on  work  supported  by  the  NationsJ  Science  Foundation  under  grant  numbered 
IRI-9406481  and  by  DARP.'^  research  grant  no.  MDA972-92-J-1012.  The  government  has  certain  rights  in 
this  material. 
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1  Introduction 


The  problem  of  understanding  and  controlling  the  mechanism  of  genetic  programming  (GP) 
is  challenging  especially  in  the  case  of  GP  extensions  for  the  discovery  and  evolution  of 
functions.  Such  GP  extensions  have  been  designed  with  the  goal  of  automating  the  discovery 
of  functions  that  are  beneficial  during  the  search  for  solutions  by  exploiting  opportunities 
to  parameterize  and  reuse  code.  Two  such  techniques  are  automatic  definition  of  functions 
(ADF)  [Koza,  1992]  and  adaptive  representation  (AR)  [Rosea  and  Ballard,  1994a].  The 
former  is  a  GP  extension  that  allows  the  evolution  of  reusable  subroutines.  The  latter  is 
based  on  the  discovery  of  useful  building  blocks  of  code.  Although  ADF  and  AR  approaches 
implement  the  ideas  of  discovery,  modification,  and  use  of  new  functions  in  different  ways, 
both  actually  evolve  a  hierarchy  of  functions  that  greatly  improve  search  efficiency.  This 
paper  refers  to  both  mechanisms  by  hierarchical  genetic  programming  (HGP). 

No  clear  mathematical  analysis  currently  exists  for  how  either  GP  or  HGP  sample  the 
solution  space.  The  goal  of  this  paper  is  to  analyze  the  influence  of  different  represen¬ 
tational  choices  on  the  behavior  of  GP.  This  paper  analyses  several  explanations  for  the 
improved  behavior  of  HGP  due  to  function  discovery  and  proposes  a  bottom-up  HGP  evo¬ 
lution  scenario:  HGP  discovers  and  exploits  useful  structures  in  a  bottom-up,  hierarchical 
manner. 

Two  complementary  dimensions  of  genetic  search  are  discussed  in  the  paper:  diversity 
of  programs  and  GP  causality.  Discovery  and  use  of  encapsulated  subroutines  causes  in¬ 
creased  population  diversity.  Experimental  evidence  outlining  increased  program  size  and 
varied  program  shape  is  presented  to  explain  this  increased  population  diversity.  The  paper 
compares  theoretical  and  practical  distributions  of  fitness  for  randomly  generated  solutions 
in  a  test  problem  characterized  by  a  finite  function  sample  space. 

Causality  relates  changes  in  the  structure  of  an  object  with  the  effect  of  such  changes 
which  represent  change-  ii;  tlic  properties  or  behavior  of  the  object.  The  principle  of  strong 
causality  states  that  small  alterations  in  the  underlying  structure  of  an  object,  or  small 
departures  from  the  cause  determine  small  changes  of  the  object’s  behavior,  or  small  changes 
of  the  effects,  respectively  ([Rechenberg,  1994],  [Lohmann,  1992]).  In  GP  small  alterations 
of  the  programs  may  generate  big  changes  in  behavior.  From  this  perspective  GP  is  weakly 
causal.  In  this  report,  the  trend  of  structures  called  birth  certificates  are  presented  as 
evidence  for  the  way  HGP  inherits  useful  structures.  Birth  certificates  represent  types  of 
crossover  in  the  genealogic  tree  of  a  solution  and  record  the  evolution  trajectory  of  that 
solution. 

The  report  outline  is  as  follows.  The  next  section  defines  the  underlying  principle  of 
HGP  and  the  resulting  change  in  representation,  then  briefly  presents  the  two  HGP  ap¬ 
proaches  used  throughout  the  experiments  and  other  related  work.  Section  3  introduces  a 
test  case  and  presents  a  theoretical  analysis  of  fitness  distributions  for  a  uniform  probability 
distribution  of  solutions.  It  analyzes  the  random  generation  of  program  trees  and  compares 
theoretical  distributions  of  partial  solutions  with  those  actually  obtained  in  GP  for  a  varying 
function  set.  Changes  in  representation  determine  changes  in  the  size,  shape,  and  behavior 
of  program  trees.  Section  4  presents  an  analysis  of  the  GP  evolution  dynamics.  The  discov¬ 
ery  of  functions  offers  a  means  for  expressing,  combining,  and  propagating  useful  building 
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blocks.  Thus,  it  contributes  in  an  essential  way  to  the  exploratory  ability  of  GP.  Discovered 
functions  represent  an  adaptive  control  mechanism  in  the  exploration-exploitation  tradeoff. 
In  conclusion  the  paper  discusses  the  results  and  suggests  future  research. 


2  Hierarchical  Genetic  Programming 

Genetic  programming  departs  from  the  genetic  algorithm  (GA)  paradigm  by  using  trees  to 
represent  genotypes  ([Cramer,  1985],  [Koza,  1992]).  Trees  provide  a  flexible  representation 
for  creating  and  manipulating  programs.  This  paper  uses  the  denotations  tree  and  subtree 
to  refer  to  the  parse  tree  of  a  program  or  a  part  of  it  respectively. 

Problem  representation  in  GP  is  defined  by  a  set  of  problem-dependent  primitive  func¬ 
tions.  Functions  of  one  or  more  variables  label  internal  nodes  of  the  tree  while  functions 
of  no  arguments,  called  terminals,  label  leaves  of  the  tree.  The  search  space  for  GP  is  the 
space  of  all  programs  that  can  be  built  using  these  initial  primitives.  The  intuition  for  hi¬ 
erarchical  GP  systems  is  that  adapting  the  composition  of  these  sets  dramatically  changes 
the  behavior  of  GP.  For  example,  the  inclusion  of  more  complex  functions,  known  to  be 
part  of  a  final  solution,  will  result  in  less  computational  effort  spent  during  search  and  thus 
will  enable  a  shorter  time  to  finding  a  final  solution. 

The  HGP  approaches  presented  below,  automatic  definition  of  functions  (ADF-GP) 
[Koza,  1992]  and  adaptive  representation  (AR-GP)  [Rosea  and  Ballard,  1994a]  use  the 
above  observation  in  different  ways  in  order  to  accelerate  search. 

2.1  Automatic  Definition  of  Functions 

The  automatic  definition  of  functions  approach  (ADF-GP)  assumes  that  parsimonious  prob¬ 
lem  solutions  can  be  specified  in  terms  of  a  main  program  and  a  hierarchical  collection  of 
subroutines.  The  main  program  invokes  a  subset  of  the  subroutines  to  perform  the  overall 
computation,  while  those  subroutines  may  in  turn  call  other  subroutines  computing  partial 
results. 

Genetic  programming  is  used  both  to  search  for  appropriate  subroutines,  and  to  find  a 
way  of  composing  discovered  subroutines  and  primitive  functions  into  a  complete  solution. 
In  this  approach,  apparently,  GP  has  to  perform  a  more  difficult  search  task.  The  problem 
becomes  well  defined  if  the  functions  and  terminals  that  can  be  invoked  by  each  subroutine 
and  by  the  main  program  are  completely  specified.  During  evolution,  only  the  fitness  of  the 
complete  program  is  evaluated. 

In  this  approach  each  individual  program  has  a  dual  structure.  The  structure  is  defined 
based  on  a  fixed  number  of  components  or  branches  to  be  evolved;  several  function  branches 
and  a  main  program  branch.  Each  function  branch  (called  ADFo,ADFi,  etc.)  has  a  fixed 
number  of  arguments.  The  main  program  branch  [Program-Body)  produces  the  result. 
Each  branch  can  be  viewed  a  piece  of  lisp  code  built  out  of  specific  primitive  terminal  and 
function  sets,  and  is  subject  to  genetic  operations.  The  set  of  function-defining  branches,  the 
number  of  arguments  that  each  of  the  function  possesses  and  the  “alphabet”  (function  and 
terminal  sets)  of  each  branch  define  the  architecture  of  a  program.  The  references  allowed 
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Figure  1:  A  hypothetical  call  graph  of  the  extended  function  set  in  the  AR  method.  The  primitive 
function  set  is  extended  hierarchically  with  functions  {DFl,  DF2,  etc.)  discovered  at  generation 
numbers  a,b,c.  A  solution  is  eventually  found  at  generation  d. 

between  function  branches  determine  a  hierarchical  organization  of  the  set  of  functions. 
Although  the  number  and  interconnectivity  of  ADFs  are  fixed,  the  definition  of  ADFs 
evolve.  Genetic  operations  on  ADFs  are  syntactically  constrained  by  the  components  on 
which  they  can  operate.  For  example,  crossover  can  only  be  performed  between  subtrees 
of  the  same  type,  where  subtree  type  depends  on  the  function  and  terminal  symbols  used 
in  the  definition  of  that  subtree.  An  example  of  a  simple  typing  rule  for  an  architecturally 
uniform  population  of  programs  is  branch  typing.  Each  branch  of  a  program  is  designated 
as  having  a  distinct  type.  In  this  case  the  crossover  operator  can  only  swap  subtrees  from 
analogous  branches. 

2.2  Adaptive  Representation 

In  contrast  to  ADF’s  passive  function  definition,  AR  explicitly  attempts  to  discover  and  use 
new  functions.  A  hierarchy  of  automatic  functions  is  created  in  a  bottom-up  fashion  as  the 
problem  is  being  solved  (see  figure  1). 

At  the  base  of  the  function  hierarchy  lie  the  primitive  functions  from  the  initial  function 
set.  More  complex  functions  are  dynamically  built  on  the  primitive  functions,  and  become 
stable  components  of  the  representation.  The  levels  in  the  hierarchy  are  discovered  by 
using  either  heuristic  information  as  conveyed  by  the  environment  or  statistical  information 
extracted  from  the  population.  The  heuristics  are  embedded  in  block  fitness  functions  which 
are  used  to  determine  fit  blocks  of  code.  The  hierarchy  of  functions  evolves  as  a  result  of 
several  steps: 

1.  Select  candidate  building  blocks  from  fit  small  blocks  appearing  in  the  population 

2.  Generalize  candidate  blocks  to  create  new  functions 
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3.  Extend  the  representation  with  the  new  functions,  noticing  if  progress  is  made. 

In  order  to  control  the  process  of  function  discovery,  AR-GP  keeps  track  of  small  blocks 
of  code  appearing  in  the  population.  A  key  idea  is  that  although  one  might  like  to  keep 
track  of  blocks  of  arbitrary  size,  only  monitoring  the  merit  of  small  blocks  is  feasible.  Useful 
blocks  tend  to  be  small  and  the  process  can  be  applied  recursively  to  discover  more  and 
more  complex  useful  blocks.  Consequently,  AR-GP  has  a  bottom-up  approach  to  function 
discovery  [Rosea  and  Ballard,  1994b]. 

The  generation  intervals  with  no  function  set  changes  represent  evolutionary  epochs.  At 
the  beginning  of  each  new  epoch,  part  of  the  population  is  extinguished  and  replaced  with 
random  individuals  built  using  the  extended  function  set  [Rosea  and  Ballard,  1994a].  The 
extinction  step  was  introduced  in  order  to  make  use  of  the  newly  discovered  functions. 

The  discovery  of  functions  in  AR  can  be  guided  by  domain  knowledge.  Most  generally, 
the  population  itself  represents  a  pool  of  statistical  information.  Global  measures  such  as 
the  population  diversity  or  local  measures  such  as  the  differential  fitness  from  parents  to 
offspring  can  be  used  to  guide  the  creation  of  new  functions. 

2.3  Other  related  work 

Modularization  is  an  approach  which  addresses  the  problems  of  inefficiency  and  scaling  in 
GP.  This  issue  has  generated  research  efforts  towards  defining  the  notion  of  building  block 
in  GP  and  finding  useful  ways  to  manipulate  modules  of  code. 

A  GP  analogy  along  the  lines  of  GA  schemata  theory  and  GA  building  block  hypothesis 
has  been  attempted  in  [O’Reilly  and  Oppacher,  1994].  The  main  goal  was  understanding 
if  GP  problems  have  building  block  structure  and  when  GP  is  superior  to  other  search 
techniques.  The  approach  was  to  generalize  the  definition  of  a  GP  schema  from  [Koza, 
1992]  to  a  collection  of  tree  fragments,  that  is  a  collection  of  trees  possibly  having  subtrees 
removed.  An  individual  instantiates  a  schema  in  case  it  “covers”  (matches)  all  the  schema 
fragments,  overlappings  between  fragments  not  being  allowed.  The  probability  of  disruption 
by  crossover  is  estimated  based  on  these  definitions.  The  authors  concluded  that  schema 
analysis  is  difficult  and  does  not  offer  an  appropriate  perspective  for  analyzing  GP. 

A  GP  structural  theory  analogous  to  GA  schemata  theory  fundamentally  ignores  the 
functional  role  of  the  GP  representation.  The  analysis  of  building  blocks  in  AR-GP  [Rosea 
and  Ballard,  1994a]  starts  from  this  hypothesis  and  takes  a  functional  approach.  The 
ADF  approach,  presented  earlier,  is  also  a  method  of  representing  and  using  modularity  in 
GP.  Another  method,  module  acquisition  ([Angeline,  1994b],  [Angeline  and  Pollack,  1994]) 
introduced  many  inspirational  ideas.  A  module  is  a  function  with  a  unique  name  defined  by 
selecting  and  chopping  off  branches  of  a  subtree  selected  randomly  from  an  individual.  The 
approach  uses  the  compression  operator  to  select  blocks  of  code  for  creating  new  modules, 
which  are  introduced  into  a  genetic  library  and  may  be  invoked  by  other  programs  in 
the  population.  Two  effects  are  achieved.  First  the  expressiveness  of  the  base  language 
is  increased.  Second  modules  become  frozen  portions  of  genetic  material,  which  are  not 
subject  to  genetic  operations  unless  they  are  subsequently  decompressed. 
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It  has  been  conjectured  that  problems  whose  solutions  present  symmetry  patterns  or 
opportunities  to  parameterize  and  reuse  code  can  be  solved  easier  in  ADF-GP  [Koza,  1994b] 
but  there  exists  no  formal  explanation  of  why  ADF-GP  works  better  than  standard  GP. 
[Kinnear,  1994]  explains  why  ADF-GP  works  by  introducing  the  notion  of  structural  regu¬ 
larity.  He  compares  ADF-GP  against  the  module  acquisition  approach  and  points  out  that 
the  module  acquisition  approach  does  not  directly  create  structural  regularity.  Kinnear  at¬ 
tributes  the  better  performance  of  ADF  to  the  repeated  use  of  calls  to  automatically  defined 
functions  and  to  the  multiple  use  of  parameters. 

The  lens  effect  [Koza,  1994b]  is  the  idea  that  the  tails  of  the  fitness  distribution  for 
randomly  generated  programs  are  larger  for  ADF-GP  than  for  standard  GP.  The  effect  is 
attributed  to  the  introduction  of  new  functions  into  the  representation.  [Altenberg,  1994] 
outlines  that  a  similar  property  should  be  observed  in  general  in  order  to  make  GP  search 
more  efficient  than  random  search:  the  upper  tail  of  the  offspring  fitness  distribution  should 
be  wider  than  that  for  random  search. 

The  problem  of  determining  the  appropriate  architectural  choices  in  ADF-GP  has  gener¬ 
ated  work  on  evolution  of  the  GP  architecture.  The  architecture  itself  can  be  evolutionarily 
selected  in  case  the  initial  population  is  architecturally  diverse  and  care  is  taken  when  cross¬ 
ing  over  individuals  having  different  architectures  [Koza,  1994b].  [Koza,  1994a]  introduces 
six  new  genetic  operations  for  altering  the  architecture  of  an  individual  program:  branch 
duplication,  argument  duplication,  branch  deletion,  argument  deletion,  branch  creation  and 
argument  creation.  These  operations  are  causal  in  the  sense  discussed  later  in  this  paper. 

A  rule  of  thumb  in  GA  literature  postulates  that  population  diversity  is  important 
for  avoiding  premature  convergence.  A  comparison  of  research  on  this  topic  is  provided 
in  [Ryan,  1994].  Ryan  shows  that  maintaining  increased  diversity  in  GP  leads  to  better 
performance.  His  algorithm  is  called  “disassortative  mating”  because  it  selects  parents  for 
crossover  from  two  different  lists  of  individuals.  One  list  of  individuals  is  ranked  based  on 
fitness  while  the  otlier  is  ranked  based  on  the  sum  of  size  and  weighted  fitness.  The  goal 
is  to  evolve  solutions  of  minimal  size  that  solve  the  problem.  However,  by  using  directly 
the  size  constraint  the  GP  algorithm  is  prevented  from  finding  solutions.  The  algorithm 
improves  convergence  to  a  better  optimum  while  maintaining  speed. 

Exploration  and  exploitation  are  recurring  themes  in  search  and  learning  problems  [Hol¬ 
land,  1992],  [Kaelbling,  1993].  Exploitation  takes  place  when  search  proceeds  based  on  the 
action  prescribed  by  the  current  system  knowledge.  Exploration  is  usually  based  on  ran¬ 
dom  actions,  taken  in  order  to  experiment  with  more  situations.  For  example,  in  learning 
classifier  systems,  roulette  wheel  action  selection  is  a  means  of  choosing  exploratory  actions. 
In  the  reinforcement  phase  of  the  control  loop  of  a  classifier  system  [Wilson,  1994],  match¬ 
ing  classifiers  that  do  not  get  activated  are  weakened.  This  lowers  the  chances  of  choosing 
unpromising  actions  in  the  near  future.  The  weakening  magnitude  is  usually  controlled 
by  an  explicit  parameter,  although  more  elaborate  schemes  are  possible  [Wilson,  1994].  In 
contrast  GP  is  a  search  technique  that  implicitly  balances  exploration  and  exploitation,  as 
will  be  showed  later. 
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3  The  Role  of  the  GP  Representation 


One  goal  of  the  paper  is  to  analyzes  the  influence  of  different  representational  choices  on 
the  behavior  of  GP  both  theoretically  and  experimentally  using  a  standard  GP  algorithm 
and  HGP.  The  test  case  chosen  is  the  parity  problem.  Parity  is  an  attractive  problem  for 
several  reasons.  First  it  operates  on  a  finite  sample  space,  the  space  of  Boolean  functions 
with  a  given  number  of  inputs.  This  enables  the  computation  of  distributions  of  interest  for 
random  choices  of  an  initial  population.  Second,  paritj'  is  difficult  to  learn  because  every 
time  an  input  bit  is  flipped,  the  output  also  changes. 

The  ODD-n-PARITY  problem  is  to  find  a  logical  composition  of  primitive  Boolean  func¬ 
tions  that  computes  the  sum  of  input  bits  over  the  field  of  integers  modulo  2.  EVEN-n-PARITY 
can  be  defined  by  flipping  the  result  of  ODD-n-PARiTY.  The  ODD-n-PARiTY  and  EVEN-n- 
PARITY  functions  appear  to  be  difficult  to  learn  in  GP,  especially  for  values  of  n  greater 
than  five  [Koza,  1992]. 

The  initial  function  set  for  the  parity  problem  in  GP  is  defined  by  the  set  of  primitive 
Boolean  functions  of  two  variables: 

=  {AND,OR,NAND,NOR}  (1) 

The  terminal  set  is  defined  by  a  set  of  Boolean  variables: 

To  =  {TTo,  Di,  D2, ...,  Dn-i) 

Any  Boolean  function  of  n  variables  is  defined  on  the  set  of  2'^  combinations  of  input 
values.  Given  a  program  implementing  a  Boolean  function,  its  performance  is  computed  on 
all  possible  combinations  of  Boolean  values  for  the  input  variables  and  is  compared  with  a 
table  defining  the  EVEN-n-PARlTY  function.  Each  time  the  program  and  the  EVEN-n-PARlTY 
table  give  the  same  resuk,  the  program  records  a  hit.  The  task  is  to  discover  a  program 
that  achieves  the  maximum  number  (2")  of  hits. 


Theoretical  analysis  of  uniform  random  sampling 

The  efficiency  of  a  GP  algorithm  depends  on  the  computational  effort  needed  to  evolve  a 
solution  with  a  given  probability.  Random  search  provides  an  upper  bound  on  the  effort 
needed.  The  probability  of  randomly  generating  a  problem  solution  depends  both  on  the 
initial  function  set,  and  on  the  method  of  generating  random  individuals.  Our  goal  is  to 
understand  the  influence  of  the  function  set  composition,  and  consequently  of  a  function 
discovery  mechanism  on  this  probability. 

Let  us  consider  the  sample  space  of  all  functions 


where  B  =  {0, 1}.  Note  that  ||5||  =  2^”,  thus  we  can  obtain  random  elements  of  S  by 
flipping  2”  distinct  fair  coins. 


6 


Consider  the  random  variable  X  mapping  the  finite  sample  space  S  onto  the  set  of 
positive  integer  numbers  M  defined  as  follows:  X  is  the  number  of  hits  of  a  randomly 
generated  Boolean  function  s  £  S. 

We  are  interested  in  analyzing  the  probability  mass  function  of  X. 

Prob{X  =  x}  =  ^2  Prob{s} 

sGS:X  {s)=x 

Consider  a  random  ^  Boolean  function  with  k  hits.  The  k  hits  are  due  to  i  1-hits  and 
to  {k  -  i)  0-hits.  EVEN-n-PARlTY  takes  an  equal  number  (i.e.  of  0  and  1  values  over 
the  set  of  input  binary  strings.  Thus,  the  number  of  Boolean  functions  that  coincide  with 
EVEN-n-PARlTY  for  a  fixed  set  of  k  input  strings  is 


k 

E 

i=o 


n 

2 

k-i 


n 

k 


(2) 


which  implies  that  X  has  a  binomial  distribution,  with  p  —  g  =  It  follows  that 


Prob{X  =  k}  ^  ■  I  I 


(3) 


The  expected  value  of  A'  is  |  and  its  variance  is  ^  and  can  be  computed  as  follows  (or 
see  [Cormen  et  a/.,  1990]): 


E[X]  =f2k.  Prob{X  =  k}  =  ^.  (n2"-i)  = 


k=0 


(4) 


1  ^ 


Tar[A']  =  £:[A-^]  -  (^[A'])^  =  ^  E  ^ 


k=0 


U  /  4 


2" 


((:^  +  i)TU=i  +  E^' 

k=0 


n 

k 


n 

T  “  4 


(5) 


3.1  Program  diversity 

In  order  to  understand  the  role  of  representation  and  the  effect  of  dynamically  changing  it 
we  designed  a  set  of  experiments  for  estimating  qualitative  measures  of  diversity  such  as 
fitness  distributions  and  program  size  in  GP  and  HGP. 

A  straightforward  definition  of  diversity  in  GP  is  the  percentage  of  structurally  distinct 
individuals  at  a  given  generation.  Two  individuals  are  structurally  distinct  if  they  are  not 

‘Here  and  in  turn  the  term  random  refers  to  structures  generated  randomly  according  to  a  uniform 
probability  distribution 

^In  order  to  prove  equality  2  use  Newton’s  binomial  on  both  sides  of  the  identity  (l-l-a:)"  =  (l+i)  ^  (1+x)^ 
and  identify  coefficients. 
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isomorphic  trees.  However,  such  a  definition  is  not  practically  useful.  It  is  computationally 
expensive  to  test  for  tree  isomorphisms.  Moreover,  associativity  of  functions  is  extremely 
difficult  to  take  into  account. 

An  easily  observable  type  of  variation  in  the  population  is  fitness  diversity.  Two  indi¬ 
viduals  are  different  if  they  score  differently. 

Another  useful  qualitative  measure  of  diversity  is  program  size.  In  HGP,  a  true  measure 
of  the  size  of  an  individual  is  obtained  by  counting  all  the  nodes  in  the  tree  resulting 
after  an  “inline”  expansion  of  all  the  called  functions  down  to  the  primitive  functions. 
This  complexity  measure  is  called  “expanded  structural  complexity”  in  [Rosea  and  Ballard, 
1994b]  and  is  based  on  the  structural  complexity  (i.e.  the  number  of  tree  nodes)  of  all  the 
functions  in  the  hierarchy  which  are  called  directly  or  indirectly  by  the  individual.  The 
expanded  structural  complexity  of  a  program  F,  denoted  IC{F),  can  be  computed  in  a 
bottom-up  manner  starting  with  the  lowest  functions  in  the  call  graph  of  F.  For  each 
subfunction  G,  called  directly  or  indirectly  by  F,  IC{G)  can  be  defined  using  a  recursive 
formula  (see  the  appendix). 

Three  experiments  are  reported  next.  First,  a  uniform  random  generation  of  parity 
tables  is  compared  to  a  GP  random  generation  of  program  trees.  Second,  we  vary  the  com¬ 
position  of  the  primitive  function  set  and  analyze  again  the  fitness  distribution  of  randomly 
generated  GP  programs.  Third,  we  analyze  the  expanded  structural  complexity  of  GP  and 
HGP  solutions.  The  method  of  generating  GP  individuals  in  the  second  experiment,  bor¬ 
rowed  from  [Koza,  1992],  is  the  ramped-half-and-half  method.  In  order  to  create  an  initial 
population  of  increased  diversity  this  method  generates  trees  of  depth  varying  modulo  the 
initial  maximum  size  (taken  to  be  six)  and  of  either  balanced  or  random  shape. 

3.2  Diversity  experiments 

If  elements  s  e  S  are  generated  uniformly  then  the  probability  of  generating  EVEN-n-PARITY 
is  For  the  EVEN-3-parit'i  problem  [Koza,  1994b]  reports  that  no  solution  is  discovered 
after  the  random  generation  of  10  million  parity  functions.  However,  the  above  analysis 
implies  that  for  n  =  3  it  should  be  considerably  easier  (one  in  256  trees)  to  find  a  solution 
if  the  random  generation  of  trees  in  GP  results  in  a  uniform  distribution  of  functions. 
Unfortunately,  even  for  a  uniform  distribution  of  functions,  the  probability  to  generate  a 
solution  decreases  super-exponentially  in  the  problem  size.  About  four  billion  GP  functions 
would  have  to  be  generated  in  order  to  find  one  that  computes  EVEN-5-PAR1TY  (n  =  5).  We 
will  see  that  GP  with  functions  does  much  better  than  this. 

Figure  2  compares  the  distribution  of  hits  obtained  for  a  population  of  tables  (ideal  case), 
GP  functions  and  ADF-GP  functions.  The  mean  and  standard  deviation  of  the  distribution 
of  randomly  generated  tables  compares  closely  to  the  theoretical  results  outlined  above  (see 
relations  (4)  and  (5))  for  n  =  5,  although  only  16,000  random  tables  were  generated.  The 
distribution  of  GP  functions  in  the  even-5-parity  problem,  with  the  function  set  defined 
in  (1),  shows  that  for  /z  <  12  or  /;  >  20  the  probability  of  having  h  hits  is  practically  zero. 
The  GP  random  distribution  is  much  narrower  than  might  be  anticipated. 

It  is  worth  examining  what  happens  when  automatically  defined  functions  are  used. 
Figure  2  shows  that  a  random  population  of  ADF-GP  trees  generated  using  the  ramped- 
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Number  of  Hits 


Figure  2:  Probability  mass  function  of  the  random  variable  X  representing  the  number  of  hits  vrith 
the  EVEN-5-PARITY  function  in  three  cases  (a)  Random  generation  of  Boolean  tables;  (b)  Random 
generation  of  standard  GP  EVEN-5-PARITY  functions;  (c)  Random  generation  of  ADF-GP  EVEN-5- 
PARITY  functions,  with  two  automatically  defined  functions  and  two  arguments  each. 


half-and-half  method  has  a  wider  distribution  of  hits  than  standard  GP.  The  effect  is  called 
the  lens  effect  in  [Koza,  1994b]. 


Varying  the  composition  of  the  primitive  function  set 

Figure  3  shows  the  hit  distributions  for  GP  when  the  composition  of  the  function  set  is 
varied.  In  the  all-functions  plots,  all  16  Boolean  functions  of  two  variables  are  included 
in  T  while  in  the  some-functions  plots  a  random  selection  of  half  of  these  16  functions, 
including  the  primitive  ones  in  .Fq,  are  part  of  the  initial  function  set  T.  Figure  3  shows  the 
same  experiment  performed  with  ADF-GP.  Two  automatically  defined  functions  have  been 
used,  each  having  two  arguments.  ADFO  has  the  function  set  Fq.  ADFl  can  additionally 
invoke  ADFO.  The  program  body  can  additionally  invoke  both  ADFO  and  ADFl. 


Expanded  structural  complexity 

Table  1  presents  a  sample  of  complexity  results  obtained  with  the  standard  GP  algorithm 
and  with  ADF-GP.  The  rows  having  0  in  the  “Generation”  column  correspond  to  an  initial 
random  generation  of  programs.  The  other  two  rows  are  the  results  at  the  end  of  successful 
runs.  The  ADF-GP  rows  include  the  structural  complexity  values  obtained  for  the  two 
evolved  sub-functions  (ADFO  and  ADFl)  and  the  main  program  body  (Body).  The  table 
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Figure  3;  Probability  mass  function  of  the  number  of  hits  when  all  or  some  (a  random  selection  of) 
Boolean  functions  of  two  variables  are  used  in  generating  EVEN-5-PARITY  programs.  The  primitive 
set  include  the  primitive  functions  AND,  OR,  NAND,  NOR. 


shows  that  expanded  complexity  in  ADF-GP  is  several  orders  of  magnitude  higher  than  the 
structural  complexity  of  programs  in  standard  GP. 


Table  1;  Complexity  results  for  EVEN-5-P.4RiTy  tested  with  the  standard  GP  and  ADF-GP  algo¬ 
rithms.  Average  values  are  determined  from  12  runs. 


Method 

Generation 

Structural  Complexity 

Expanded  Complexity 

ADFO 

ADFl 

Body 

Best 

Average 

Std.GP 

0 

- 

- 

- 

15 

6.53 

Std.GP 

28 

- 

- 

180 

241,2 

ADF-GP 

0 

15 

15 

45 

423 

439.9 

ADF-GP 

30 

41 

13 

95 

5497 

6429.3 

3.3  Comparison  of  results 

The  narrow  GP  hit  distribution  suggests  a  low  population  diversity.  A  solution  by  means 
of  GP  will  be  difficult  to  obtain,  because  it  would  require  more  generations,  and  thus  an 
increased  computational  effort,  to  create  diverse  individuals.  Moreover,  search  may  be 
successful  provided  that  fitness-proportionate  selection  and  the  genetic  operators  used  do 
not  narrow  the  population  diversity  even  more.  This  change  in  the  hit  distribution  for  HGP 
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is  a  direct  result  of  the  introduction  of  higher  level  functions  into  the  representation.  It  is 
one  of  the  hypotheses  explaining  why  HGP  approaches  work  better  than  standard  GP. 

When  the  function  set  is  varied  an  even  wider  distribution  will  result  (see  the  GP-some- 
f unctions  and  GP-all- functions  distributions  from  figure  3).  When  defined  functions  are 
used  the  hit  distribution  does  not  become  much  wider.  However  the  ADF-GP  method 
still  generates  larger  standard  deviations  and  thus  increased  diversity.  Randomly  generated 
programs  with  the  highest  number  of  hits  (28)  were  obtained  using  this  method.  Overall, 
the  distributions  of  hits  look  very  similar  when  a  larger  selection  of  functions  is  used. 

Note  that  Tq  is  complete  in  the  sense  that  any  Boolean  function  can  be  written  just 
using  functions  from  Tq.  The  effect  of  apparently  non-useful  functions,  initially  included  in 
the  function  set,  is  beneficial.  All  new  functions,  either  ADFs  or  initial  extensions  of  the 
function  set,  are  based  on  the  initial  primitive  functions  and  terminals.  Theoretically,  the 
search  space  remains  the  same,  the  space  of  all  programs  that  can  be  built  based  on  Pq 
and  To-  The  sampling  of  the  search  space  by  means  of  the  crossover  operator  is  changed  in 
ADF-GP.  Still,  any  solution  that  can  be  obtained  by  ADF-GP  could  theoretically  be  found 
by  GP  although  the  time  to  find  a  solution  would  be  significantly  larger.  From  the  static 
point  of  view  of  creating  an  initial  population,  using  ADFs  is  equivalent  to  considering  a 
larger  initial  function  set. 

A  more  formal  interpretation  of  this  remarks  can  be  stated  by  considering  the  closure 
requirement  in  GP  [Koza,  1992].  Closure  requires  that  any  function  be  well  defined  for 
any  combination  of  arguments  (terminals  or  results  of  other  function  calls)  that  it  may 
encounter.  Suppose  that  any  subtree  returns  a  value  from  a  domain,  call  it  P,  and  that 
the  result  returned  by  a  program  depends  on  a  subset  of  variables  from  T  which  defines 
the  input  space.  Define  Ptotai  to  be  the  set  of  functions  mapping  the  input  space  onto  V. 
Then  an  ADF  is  a  function  from  Ptotal-  ADF-GP  may  simply  be  interpreted  as  GP  over  an 
enlarged  function  set  Ptotai-  Over  generations,  the  use  of  ADFs  is  equivalent  to  a  dynamic 
sampling  of  various  functions  from  this  much  larger  function  set. 

Naturally,  T  C  Piotai-  It  may  be  difficult  to  determine  the  appropriate  functions  from 
Ptotal  necessary  to  solve  a  given  problem.  It  is  unrealistic  to  consider  huge  functions  sets 
in  either  GP  or  ADF-GP.  However,  GP  can  be  used  to  select  primitives  that  can  be  better 
combined  to  yield  candidate  solution  improvements  [Koza,  1994b].  In  this  case,  automatic 
selection  of  primitives  will  have  its  computational  cost. 

The  increased  fitness  diversity  is  determined  by  a  larger  set  of  functions  that  is  given  to 
express  candidate  solutions.  Equivalently,  the  expanded  set  of  functions  biases  GP  search 
towards  regions  of  the  search  space  containing  better  individuals. 

The  increased  standard  deviation  of  program  hits  can  be  correlated  with  the  increased 
size  and  more  diverse  structure  of  individuals  obtained  by  using  ADF-GP  or,  similarly, 
AR-GP.  This  results  in  an  increased  GP  exploration  of  the  space  of  programs.  The  use  of 
an  HGP  approach  enables  the  manipulation  of  a  population  of  higher  diversity  programs, 
which  positively  affects  the  efficiency  of  an  HGP  algorithm  for  complex  problems. 
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4  HGP  Evolution  Dynamics 


GP  evolution  dynamics  has  been  very  difficult  to  analyze.  The  traditional  analysis  of  GAs 
by  Holland  [Holland,  1992]  focuses  on  the  propagation  of  schemata  from  one  generation 
to  the  next.  The  building  block  hypothesis  ([Holland,  1992],  [Goldberg,  1989])  outlines 
the  importance  of  small  schemata,  called  building  blocks,  in  the  proper  functioning  of  a 
GA.  More  recently,  crossover  has  been  considered  the  differentiating  feature  that  gives  a 
GA  advantages  over  other  stochastic  methods  in  certain  types  of  problems.  For  example 
[Eshelman  and  Schaffer,  1993]  brings  evidence  that  crossover  with  pair-wise  mating  helps 
propagating  middle  order  building  blocks. 

The  arguments  presented  so  far  have  analyzed  a  static  picture  of  GP.  The  wider  hit 
distributions  and  the  increased  expanded  structural  complexity  suggested  an  increased  ex¬ 
ploration  potential  of  GP  with  functions.  The  focus  of  attention  in  this  section  moves  to  an 
analysis  of  the  HGP  evolution  dynamics  through  the  crossover  operator.  A  simple  analysis 
of  the  effects  of  the  crossover  operator  suggests  that  GP  structures  are  highly  unstable. 
However,  a  more  careful  analysis  of  the  way  HGP  discovers  useful  structures  reveals  that 
selection  gradually  favors  changes  with  small  effects  on  the  individual  behavior.  A  hierarchy 
of  functions  is  essential  in  order  to  extend  the  potential  for  state  space  exploitation. 

4.1  GP  Causality 

The  main  problem  in  identifying  how  HGP  works  is  determining  the  effects  of  the  crossover 
operation  as  reflected  in  the  variation  of  fitness  from  parents  to  offspring  and  in  the  popu¬ 
lation  composition  at  a  given  time. 

The  intuition  is  that  most  crossover  operations  have  a  harmful  effect.  In  particular, 
offspring  of  individuals  that  are  already  partially  adapted  to  the  “environment”  and  already 
have  a  complex  structure  are  more  likely  to  have  a  worse  fitness.  This  is  close  to  the 
conclusions  on  the  role  of  mutation  in  natural  evolution  [Wills,  1993].  It  is  also  in  agreement 
with  our  intuition  that  a  small  change  in  a  program  may  drastically  change  the  program 
behavior.  In  addition  there  is  the  following  simple  argument.  Consider  a  partial  solution  to 
a  hypothesis  formation  problem  obtained  using  standard  GP  and  represented  by  a  tree  T. 
Consider  that  T  is  selected  as  a  parent  and  it  is  possible  to  obtain  a  solution  by  modifying 
T  in  such  a  way  that  a  certain  subtree  T,-  is  not  changed.  Consider  also  that  crossover 
points  are  chosen  with  uniform  probability  over  the  set  of  m  nodes  of  T.  The  probability 
of  choosing  a  crossover  point  v  that  does  not  lie  within  T,-  is: 

Fro6(Select(t;)|t;  ^  T.)  =  1  - 

otze(l ) 

The  bigger  T,-  is  (and  this  is  true  in  the  case  of  a  hypothetical  convergence  to  a  solution) 
the  smaller  is  the  probability  of  keeping  it  unchanged.  The  dynamics  of  trees  shows  the 
phenomenon  of  instability  ot  poor  causality  of  GP  structures.  Next  we  discuss  four  important 
issues  that  show  a  more  complete  picture  of  the  problem. 

First,  how  much  code  of  a  parse  tree  representing  an  individual  is  effective?  It  is  well 
known  that  GP  evolves  non  parsimonious  trees  if  no  size  pressure  is  included  in  the  fitness 
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evaluation,  a  phenomenon  suggestively  called  “defense  against  crossover”  [Altenberg,  1994] 
However,  the  useless  regions  of  code  may  represent  reservoirs  of  genetic  material  [Angelina, 
1994a].  They  either  preserve  or  evolve  good  fragments  of  code  to  be  activated  later  during 
evolution  as  a  result  of  crossover.  One  such  example  is  presented  in  figure  4.  Moreover,  re¬ 
dundant  regions  of  code  may  be  created  artificially,  in  analogy  with  the  natural  phenomenon 
of  gene  duplication,  in  order  to  evolve  a  better  program  by  specializing  its  treatment  of  a 
subclass  of  inputs  [Koza,  1994a]. 


Figure  4:  The  structure  tree  for  a  GP  solution  to  EVEN-5-PARITY.  The  notion  of  structure  tree  was 
introduced  in  [Rosea  and  Ballard,  1994a]  with  the  goal  of  qualitatively  analyzing  program  transformations 
during  evolution.  A  structure  tree  has  its  nodes  labeled  with  the  most  recent  generation  number  when 
the  node  played  a  pivot  role  in  a  crossover  operation.  Zero  labeled  nodes  remained  unchanged  from  the 
initiaJ  generation.  The  highlighted  subtrees  did  not  play  any  role  in  the  evaluation  of  their  parents  but  are 
importsmt  in  the  final  solution. 

Second,  from  a  topological  point  of  view,  where  are  most  of  the  crossover  changes  per¬ 
formed?  Equivalently  we  can  ask  about  the  expected  height  of  crossover  pivot  points.  A 
fundamental  remark  is  that,  in  current  GP  practice,  crossover  nodes  are  chosen  according 
to  a  uniform  probability  distribution.  If  we  additionally  assume,  for  a  rough  approximation, 
that  trees  operated  upon  are  complete  binary  tress  we  can  compute  the  expected  height  of 
crossover  pivot  points: 


E[CrossoverH  eight]  =  Prob{i)  ■  H eight {i)  =  2  — 

i 


h 

2^^  -  1 


where  Prob{i)  is  the  probability  of  choosing  node  i  of  height  Height{i)  as  a  crossover  point, 
and  h  is  the  tree  height.  The  leaf  nodes  of  the  complete  binary  tree  all  have  height  1.  This 
result,  although  based  on  an  assumption,  shows  that  most  of  the  changes  are  closer  to  the 
tree  bottom.  The  effect  can  be  noticed  on  the  tree  in  figure  4,  which  is  a  typical  case. 

Third,  what  is  the  influence  of  selecting  smaller  or  bigger  subtrees  to  participate  in 
the  crossover  operation?  We  hypothesized  that  the  crossover  operator  generates  non-causal 
changes  in  GP.  A  complete  answer  to  this  question  would  involve  an  analysis  of  the  proper¬ 
ties  of  the  function  set.  The  effect  of  a  small  change  can  be  severe  in  problems  of  symbolic 
regression,  or  less  severe  in  problems  of  regression  of  Boolean  functions.  For  example  the 
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result  of  the  Boolean  function  represented  in  figure  5  will  be  given  by  the  result  of  evaluating 
5  on  a  given  fitness  case  only  if  the  evaluation  of  the  following  Boolean  expression  is  true: 

a  ■  -i/3  -  —ij 

The  longer  the  path  to  <5  the  higher  will  be  the  probability  that  6  plays  a  reduced  role  in 
the  overall  evaluation. 


AND 


Figure  5:  A  small  change  in  a  Boolean  tree  will  not  necessarily  determine  a  sharp  chcinge  in  the  program 
behavior. 

Fourth,  how  does  GP  exploit  structures?  In  contrast  to  GP  crossover,  the  GA  crossover 
operator  is  homologous,  that  is  it  maintains  fixed  positions  for  exchanged  alleles.  GP 
crossover  is  non-homologous  in  the  sense  that  it  does  not  preserve  the  position  of  the 
subtree  on  which  it  operates,  being  allowed  to  paste  a  subtree  at  any  tree  level.  The  prob¬ 
ability  of  choosing  homologous  crossover  points  in  two  structurally  similar  parents  in  order 
to  transmit  the  parent  functionality  to  offspring  is  inversely  proportional  to  the  product  of 
the  parent  sizes,  i.e.  it  is  very  low.  Moreover,  if  trees  grow  in  size,  this  probability  decreases 
even  more  and  becomes  negligible.  This  implies  that  even  when  the  two  parents  are  identi¬ 
cal,  offspring  will  most  often  have  a  totally  different  functionality,  and  most  probably  they 
will  score  less  than  parents.  Selection  favors  crossover  changes  that  recombine  parts  of  the 
structure  of  the  parents  so  as  to  improve  performance,  but  how?  In  several  problem  domains 
one  can  observe  the  superposition  of  the  parent  behaviors  in  the  offspring.  In  an  example 
for  the  problem  of  finding  an  impulse  response  function,  Koza  showed  that  crossover  deter¬ 
mines  an  improved  offspring  performance  by  improving  one  parent’s  performance  for  one 
portion  of  the  time  domain,  and  inheriting  the  behavior  of  the  other  parent  for  the  rest  of 
the  domain  [Koza,  1994b].  Such  a  behavior  has  been  interpreted  as  ’’case  splitting”:  GP 
refines  a  partial  solution  by  changing  a  subtree  so  that  the  program  treats  separately,  in  a 
more  detailed  way,  a  particular  input  case.  In  this  case,  structures  are  exploited  through 
the  function  they  have  when  computing  fitness. 

In  spite  of  the  apparent  non-causality,  GP  evolves  better  and  better  solutions.  Two 
explanations  could  be  given  to  this  apparent  paradox.  A  first  explanation,  which  holds 
mostly  in  early  stages  of  evolution,  is  the  exploration  of  the  space  of  programs,  as  discussed 
in  the  previous  section.  Once  the  population  average  fitness  increases,  it  becomes  less 
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probable  to  find  above  average  individuals  by  pure  exploration.  Presumably,  GP  exploits 
the  structures  preserved  in  the  population.  For  this,  most  changes  selected  for  in  later 
stages  of  evolution  are  small  changes,  most  probably  changes  at  higher  tree  depths  and  in 
subtrees  of  small  heights.  Such  changes  are  causal  changes  because  they  slightly  alter  the 
function  of  offspring  in  comparison  to  parents. 

4.2  HGP  causality 

The  above  properties  and  problems  are  inherited  by  ADF-GP.  Moreover,  ADF-GP  presents 
an  even  higher  instability  for  crossover  due  to  two  main  reasons:  amplification  of  effects 
in  the  subroutine  hierarchy  and  lexical  scoping  which  characterizes  subroutine  definitions. 
Note  that  the  ADF  approach  attacks  the  search  problem  at  different  structural  levels  si¬ 
multaneously.  GP  has  to  discover  both  the  definitions  for  a  fixed  set  of  sub-functions,  each 
w'ith  a  predefined  number  of  parameters,  and  how  to  combine  calls  to  the  automatically 
defined  functions  within  the  main  body.  This  corresponds  roughly  to  discovering  a  way  to 
decompose  the  problem  and  solving  the  subproblems  given  only  the  maximum  number  of 
subproblems  and  the  general  structure  of  the  subproblems  (i.e.  the  number  of  parameters). 
Due  to  the  imposed  ordering  of  ADFs  we  can  consider  each  ADF  as  a  different  structural 
level. 

The  amplification  of  effects  in  the  subroutine  hierarchy  can  be  illustrated  with  a  simple 
example.  If  crossover  determines  a  change  in  a  low  ADF  in  the  hierarchy,  for  instance 
ADFo,  the  change  will  cause  a  different  behavior  for  all  subroutines  which  invoke  ADFq 
and  in  all  subtrees  which  invoke  the  affected  subroutines.  Thus,  a  change  at  the  basis  of 
the  hierarchy  in  an  individual  will  drastically  change  the  individual  behavior.  A  change  at 
a  higher  level  in  the  hierarchy  could  be  amplified  too,  provided  that  the  subroutine  changed 
is  effectively  used  by  other  subroutines  or  the  main  program  more  than  once. 

The  lexica]  scoping  problem  is  illustrated  in  figure  6.  During  GP  search,  modifications 
are  alternatively  made  at  each  of  the  structural  levels.  A  code  fragment  brought  from 
another  individual  changes  its  function  entirely  if  it  contains  calls  to  ADFs.  For  example, 
consider  a  piece  of  code  involving  calls  to  lower  order  ADFs  that  is  pasted  in  a  higher-order 
function  or  the  main  body  as  a  result  of  a  crossover  operation,  and  suppose  the  definitions 
of  the  ADFs  in  the  two  parents  are  entirely  different.  Lexical  scope  dictates  the  definition 
to  be  used  when  computing  a  call  to  a  sub-function,  so  that  the  calls  to  ADFs  from  the 
piece  of  code  transplanted  will  refer  to  the  ADF  definitions  in  the  new  scope.  The  crossover 
operation  will  most  probably  change  the  function  of  the  receiving  parent  completely. 

The  arguments  above  suggest  the  tendency  towards  even  more  non-causal  effects  of  the 
crossover  operation  in  ADF-GP.  The  non-causality  property  of  ADF  is  undesirable  in  later 
stages  of  evolution  as  it  prevents  GP  from  exploiting  good  structures  already  incorporated 
in  the  population. 

In  general,  GP  and  ADF-GP  exhibit  poor  causality.  It  is  useful  to  visualize  how  the 
search  for  a  solution  may  generally  proceed  in  ADF-GP.  Each  of  the  ADF  functions  rep¬ 
resents  a  different  subroutine.  Consider  the  last  modification  imposed  on  a  program  tree 
before  it  becomes  an  acceptable  solution.  It  is  very  unlikely,  but  not  impossible  that  this 
last  change  has  been  a  change  with  a  large  influence,  for  example  a  change  in  one  of  the 
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Figure  6:  The  non-causality  of  ADF:  A  fragment  of  code  copied  from  individual  A  into  individual  B  finds 
itself  in  a  new  lexical  environment,  where  local  definitions  for  ADFs  apply.  A  represents  the  donor  parent 
before  crossover  is  applied  while  B  represents  the  receiving  parent  after  the  crossover  operation. 

functions  at  the  basis  of  the  hierarch}'.  This  situation  represents  a  lucky  change.  Most 
probably,  though,  it  was  a  change  at  the  highest  level,  in  a  subtree  of  small  height  of  the 
program  body. 

We  hypothesize  as  a  general  principle  of  GP  dynamics  that  selection  most  often  favors 
small  changes.  Only  such  changes,  respecting  the  principle  of  strong  causality,  have  the 
highest  chance  of  being  successful.  The  effect  of  this  principle  is  a  stabilization  on  lower 
level  ADFs  that  will  be  useful.  The  evolutionary  process  freezes  good  subroutines  at  a  given 
hierarchy  level  and  looks  for  changes  at  higher  levels  [Simon,  1973].  Hence: 

Hypothesis.  Hierarchical  genetic  programming  (and  in  particular  ADF-GP)  discovers 
and  exploits  useful  structures  in  a  bottom-up  manner. 

Note  that  this  hypothesis  is  the  basic  idea  in  the  AR-GP  extension.  In  AR-GP,  the 
selection  of  potentially  useful  subroutines  generalizing  blocks  of  code  drives  the  adaptation 
of  the  problem  representation. 

4.3  Birth  certificates 

In  order  to  test  the  above  hypothesis  w^e  have  studied  the  most  recent  part  of  the  genealogy 
tree  for  EVEN-n-PARiTY  parity  problems.  This  was  done  by  assigning  to  each  individual  a 
birth  certificate  that  specifies  its  parents  and  the  method  of  birth  (one  of  ADFO  crossover, 
ADFl  crossover,  main  program  body  crossover  or  reproduction).  We  hoped  that  an  analy¬ 
sis  of  the  birth  certificates  starting  with  the  final  solution  and  tracing  its  origin  backwards. 
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would  shed  light  on  the  GP  dynamics,  as  hypothesized  above.  In  order  to  determine  the  ef¬ 
fect  of  the  different  types  of  birth  operations  we  compute  a  temporally  discounted  frequency 
factor  for  a  given  solution  tree  T  for  each  type  of  birth  (birth-type): 

,  _  \  j  ^  X{birth-type}  (0 ' (6) 

^  '1'  ^  »=0 

where  kr  is  the  number  of  programs  in  the  genealogy  tree  of  T  down  to  a  depth  d,  and 
X{type}{Ti)  is  the  characteristic  function  of  ancestor  T,-  of  T,  returning  1  if  T,-  has  a  birth 

certificate  of  type  birth-type  and  0  otherwise.  The  scaling  factor  ^ _ ^  is  a  normalizing 

constant  that  makes  each  type  of  discounted  frequencies  for  a  fixed  tree  T  add  up  to  1. 
We  used  a  discounted  formula  to  reflect  the  higher  importance  of  crossover  operations  from 
more  recent  generations. 

Table  2  presents  the  results  for  several  successful  runs  of  ADF-GP  for  EVEN-5-PARITY, 
with  two  ADFs  and  three  arguments  each.  In  most  cases,  the  frequency  factors  are  highest 
for  the  program-body  or  clearly  decrease  from  program-body  to  ADFl  to  ADFO.  These 
results  support  the  earlier  conclusion  that  ADF-GP  search  relies  in  most  cases  on  changes 
at  higher  and  higher  structural  levels  which  make  it  possible  to  exploit  good  code  fragments 
that  already  appeared  in  the  population. 


Table  2:  Statistics  of  birth  certificates  in  successful  runs  of  even-5-parity  using  ADF-GP  with  a 
zero  mutation  rate.  Each  certificate  of  a  given  type  counts  one  unit  and  is  temporally  discounted 
with  a  discount  factor  7  =  0.8  based  on  its  age.  Only  certificates  at  most  8  generations  old  have 
been  considered. 


GP  Run 

#  Individuals 
Explored 

Birth  Certificate  Frequency 

Final 

Generation 

ADFO 

ADFl 

Body 

1 

123.009 

0.295 

0.0 

0.704 

32 

2 

110,692 

0.221 

0.472 

0.416 

32 

3 

62,699 

0.077 

0.526 

0.397 

17 

4 

35,162 

0.447 

0.102 

0.451 

9 

5 

55,748 

0.1 

0.214 

0.685 

15 

6 

55,438 

0.093 

0.202 

0.704 

15 

The  above  numerical  results  have  taken  into  account  only  a  small  time  window  compared 
to  the  entire  number  of  generations.  A  complete  picture  of  the  importance  of  various  types  of 
crossover  during  the  entire  GP  evolution  can  be  constructed  from  a  more  detailed  analysis  of 
birth  certificates.  Such  an  analysis  is  depicted  in  figure  7.  It  suggests  the  overall  importance 
of  a  birth  certificate  type  from  generation  0  till  the  solution  is  found.  While  the  percentage 
of  program-body  changes  increases,  the  percentage  of  ADF  changes  decreases. 

A  similar  analysis  is  depicted  in  figure  8.  Part  (a)  represents  a  stacked  chart  which 
suggests  both  the  overall  importance  of  a  birth  certificate  type  as  well  as  its  trend  over  the 
entire  evolution  period,  from  generation  0  till  the  solution  is  found.  Part  (b)  displays  the 
same  results  in  an  overlapping  fashion  to  allow  for  better  comparison  among  the  frequency 
types.  The  stabilization  of  changes  in  the  hierarchy  occurs  bottom-up. 
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Generation  number 


Figure  7:  Variation  of  the  fraction  of  crossover  types  over  generations,  while  looking  for  a  solu¬ 
tion  to  EVEN-5-PARITY.  Random  indicates  the  propagation  of  random  individuals  from  the  initial 
population  due  to  reproduction. 


The  results  of  this  section  support  the  hypothesis  that  HGP  discovers  and  exploits 
useful  structures  in  a  bottom-up,  hierarchical  manner.  AR-GP  has  an  explicit  policy  for 
a  bottom-up  exploitation  of  discovered  structures  for  making  search  more  efficient,  while 
ADF-GP  neglects  it. 

5  Discussion  of  Results 


The  main  difficulty  of  solving  complex  parity  problems  in  GP  is  that  the  computational 
effort  increases  exponentially  with  the  size  of  the  problem.  Each  fitness  evaluation  becomes 
more  expensive  as  the  problem  is  scaled  up.  For  instance,  the  number  of  fitness  cases  in 
the  EVEN-6-PARITY  problem  is  twice  that  of  the  even-5-parity  problem  and  it  doubles 
with  every  unitary  increase  in  the  order  of  the  problem.  A  co-evolutionary  approach  such 
as  in  [Hillis,  1990]  could  reduce  the  cost  of  a  fitness  evaluation  by  relying  on  a  subset 
of  fitness  cases  which  evolves  dynamically  by  being  controlled  in  its  turn  with  a  genetic 
algorithm.  Also,  the  number  of  fitness  evaluations  necessary  to  find  a  solution  with  high 
probability  increases  with  problem  size  [Koza,  1994b]  in  the  standard  GP  implementation. 
One  explanation  of  the  poor  GP  convergence  is  the  inability  of  standard  GP  to  exploit 
opportunities  for  code  generalization  and  reuse.  In  contrast,  by  using  ADFs  or  adapting 
the  representation  as  in  AR-GP  the  same  problems  can  be  solved  more  easily  ([Koza,  1994b], 
[Rosea  and  Ballard,  1994c]).  ^^e  gave  a  qualitative  explanation  of  the  improved  behavior 
of  HGP,  based  on  an  analysis  of  the  evolution  process  on  two  dimensions:  diversity  and 
causality.  Next  we  relate  these  ideas  to  the  tradeoff  between  exploration  and  exploitation. 

This  paper  shows  that  there  exists  an  implicit  bias  in  the  random  generation  of  GP 
solution  encodings  which  confine.s  population  diversity.  Diversity  increcises  as  a  result  of 
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Figure  8:  (a)  Distribution  trend  of  birth  certificates  (b)  Overlapping  distributions  show  the  impor¬ 
tance  of  the  different  crossover  types  when  looking  for  a  solution  to  even-5-parity  that  was  found 
in  generation  15. 

changes  in  representation.  The  expanded  structural  complexity  of  HGP  individuals  also 
increases  diversity  and  highlights  the  distinctive  size  and  shape  of  trees  as  compared  to 
standard  GP.  Increased  diversity  is  related  to  an  increased  exploration  of  the  space  of 
programs. 

The  results  presented  confirm  that  as  the  population  evolves,  increasingly  causal  changes 
become  more  important  and  are  selected.  Low  level  crossover  changes  in  the  function 
hierarchy  are  highly  non-causal  and  have  an  exploratory  role.  Exploitative  changes  are 
adopted  later  in  the  process  as  the  average  program  fitness  increases. 

The  stabilization  of  changes  in  the  function  hierarchy  occurs  bottom-up.  The  GP  search 
process  exploits  the  structures  already  discovered  although  it  does  not  avoid  spending  un¬ 
necessary  effort  with  state  space  exploration.  Useful  genetic  operations  are  also  the  more 
causal  ones.  Thus,  causality  is  correlated  with  search  space  exploitation. 

In  most  learning  approaches  the  system  must  have  an  explicit  policy  of  balancing  ex¬ 
ploration  and  exploitation.  In  contrast  GP  is  a  search  technique  that  implicitly  balances 
exploration  and  exploitation,  as  I  argue  next. 

Discovery  and  evolution  of  functions  amplifies  the  exploration  ability  of  the  GP  search 
process.  However,  as  the  best-of-generation  program  fitness  increases,  the  probability  of 
falling  upon  good  individuals  by  exploration  decreases  substantially.  The  GP  search  process 
exploits  the  structures  already  discovered,  although  it  does  not  avoid  spending  unnecessary 
effort  with  state  space  exploration.  Increased  exploitation  corresponds  to  more  and  more 
causal  effects  of  the  genetic  operations  (see  table  3).  This  suggests  that  GP  is  able  to 
dynamically  balance  exploration  and  exploitation. 

Another  idea  suggested  by  the  causality  analysis  is  that  the  tradeoff  can  also  be  adap¬ 
tively  controlled  by  modifying  the  rate  at  w'hich  causal  or  non-causal  genetic  operations 
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are  applied  so  that  GP  spend  its  search  effort  in  a  more  efficient  way.  In  the  following  we 
discuss  recent  improvements  of  HGP  algorithms  from  this  perspective. 


Table  3:  Correlation  between  causality  and  exploratory  ability  in  GP  search. 


Time  (generation) 

Early 

Advanced 

Crossover  changes 

Non-causal 

Causal 

Exploration 

High 

Low 

Exploitation 

Low 

High 

An  interesting  extension  of  ADF-GP  confirms  the  importance  of  the  idea  of  causality  in 
GP.  [Koza,  1994a]  introduces  six  new  genetic  operations  for  altering  the  architecture  of  an 
individual  program:  branch  duplication,  argument  duplication,  branch  deletion,  argument 
deletion,  branch  creation  and  argument  creation.  All  addition  operations  respect  the  prin¬ 
ciple  of  strong  causality  discussed  before.  They  are  performed  such  that  they  preserve  the 
behavior  of  the  resulting  programs.  They  merely  increase  the  potential  for  program  refine¬ 
ment  and  thus  they  resemble  the  process  of  gene  duplication  in  natural  evolution  [Koza, 
1994a].  The  duplication  of  elements  of  program  architecture  (branches  or  arguments)  is 
done  in  conjunction  with  a  random  replacement  of  the  invocations  of  the  corresponding 
element  to  the  duplicated  copj-.  Such  an  operation  decreases  the  probability  that  a  future 
random  change  will  drastically  change  the  behavior  of  the  program.  It  respects  the  princi¬ 
ple  of  strong  causality  while  allowing  for  future  behavior  improving  changes.  An  analogous 
conclusion  can  be  drawn  for  the  creation  operations.  The  deletion  operations  do  not  pos¬ 
sess  the  nice  properties  mentioned  above.  They  have  the  antagonist  role  of  confining  the 
increase  in  size  of  the  evolved  programs. 


6  Conclusions 

This  report  presented  a  unifying  view  of  the  two  approaches  to  the  discovery  of  functions, 
ADF-GP  and  AR-GP  emphasizing  the  hierarchical  structure  of  the  resulting  problem  rep¬ 
resentation.  It  showed  that  the  exploration  of  the  search  space  in  GP  depends  on  the  power 
of  the  discovery  and  evolution  of  functions. 

The  report  also  analyzed  the  causality  of  the  crossover  operator  in  GP  and  suggested 
that  search  control  parameters  can  be  adapted  for  speeding  up  GP  search.  Standard  GP 
presents  a  characteristic  instability  or  poor  causality  of  the  structures  evolved,  which  can  be 
varied  by  changing  the  probability  of  selecting  crossover  points.  This  effect  is  amplified  in 
GP  with  function  discovery.  Poor  causality  has  been  discussed  related  to  the  exploration- 
exploitation  tradeoff  in  search  problems. 

Arguments  for  a  bottom-up  evolutionary  thesis  of  GP  were  discussed.  Early  stages  of 
evolution  in  GP  usually  discover  stable  components  [Simon,  1973].  Replaying  backwards 
the  genealogy  tree  that  resulted  in  a  problem  solution  shows  that  most  changes  in  later 
stages  of  evolution  are  performed  at  the  higher  hierarchical  levels.  This  suggests  that  in  the 
unconstrained  HGP  approach  such  as  ADF-GP  there  are  implicit  constraints.  Early  in  the 
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process  the  changes  are  focused  towards  the  evolution  of  more  primitive  functions.  Later 
in  the  process  the  changes  are  focused  towards  the  evolution  of  program  control  structures. 

Discovery  of  functions  is  also  an  adaptive  mechanism  for  trading  off  exploration  and 
exploitation  in  GP.  Most  often  the  control  structure  of  a  search  algorithm  explicitly  balances 
exploration  and  exploitation  by  means  of  control  parameters.  In  contrast,  GP  is  a  search 
technique  that  implicitly  balances  exploration  and  exploitation. 

The  emergent  structure  in  ADF-GP  as  an  effect  of  causality  is  an  explicit  policy  in  AR- 
GP.  The  bottom-up  evolution  of  HGP  discussed  in  the  paper  justifies  this  explicit  search 
for  building  blocks  and  the  expansion  of  the  problem  representation,  which  was  successfully 
used  in  AR-GP  [Rosea  and  Ballard,  1994a].  AR-GP  uses  fit,  small  blocks  to  define  new 
functions.  AR-GP  evolves  a  variable  hierarchy  of  functions,  each  having  a  variable  number 
of  arguments.  A  process  of  extinction  of  population  individuals  accelerates  the  use  of  the 
discovered  functions. 

Future  work  aims  at  an  improved  version  of  AR-GP  that  will  additionally  allow  for  evo¬ 
lution  of  functions.  One  could  use  the  insights  about  the  dynamics  of  GP  search  presented 
in  this  paper  to  come  up  with  a  more  refined  and  efficient  GP-like  system,  that  involves 
automatic  adaptation  of  control  parameters. 
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Appendix:  Expanded  structural  complexity 

We  evaluate  the  expanded  structural  complexity  of  function  F,,  7C(F,),  after  we  have 
evaluated  IC{Fj)  for  all  0  <  j  <  i,  by  performing  inline  substitutions  of  all  functions  called 
by  Fi  with  their  expanded  bodies. 

From  a  computational  point  of  view,  we  also  have  to  keep  track  of  the  number  of  times 
each  argument  of  a  function  appears  in  the  expanded  version  of  the  function,  i.e.  after  the 
inline  substitution  of  each  of  the  lower  order  functions  have  been  performed.  For  example, 
if  Fi  has  a  call  to  Fq,  the  numbers  of  times  each  of  Fi’s  arguments  appear  before  and  after 
the  inline  substitution  of  Fq  will  differ  in  general.  This  influences  the  expanded  structural 
complexity  of  a  function  (say  F2)  that  calls  Fi.  The  general  case  is  considered  below. 

If  Fi  has  n,  =  j  arguments,  then  we  represent  the  number  of  times  each  of  F,  ’s  arguments 
appear  in  F^’s  expression  by  a  vector  x'  =  j^x]  xj  ...  x]|  ,  where  the  t  superscript  is  the 
translation  vector  operator.  In  the  formulae  below,  T  is  an  arbitrary  subtree  of  Fi.  The  root 
label  of  T,  root{T),  can  be  a  function  from  the  initial  function  set  Fo,  a  newly  discovered 
function,  a  variable  of  F;  or  a  leaf  which  is  not  a  variable.  When  root[T)  is  a  function  of 
arity  k,  Fi, ...,  Tk  are  the  ordered  subtrees  of  T. 
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IC{Fi)  and  x*  are  computed  in  a  bottom-up  manner,  starting  with  i  =  0.  For  each 
i,  IC{Fi)  and  x'  =  x‘(F,)  are  defined  recursively,  starting  with  the  tree  representing  F,-, 
T  =  Fi  in  the  following  way: 


IC{T)  = 


IC{root{T,))  +  Y,ICiTi)*x} 

/=! 


if  T  is  a  leaf 
if  root{T)  is  a  function 


IC{f)  has  been  computed  previously  if  /  =  root{T)  is  a  discovered  function  (/  € 
—  -^0)1  or  IC{f)  —  1  for  a  primitive  function  (/  €  Fq)- 


'  0 

[0...010...0]' 


x'(T)  =  \ 


rim 


1=1 


if  F  is  a  non-argument  leaf 
if  T  is  the  /-th  argument  of  F, 

(a  1  on  the  /-th  position) 

if  root{T)  ^  Fjn,  m  <  i  and  rim 

is  the  number  of  arguments  of  F„j 


x(F;)  represents  the  number  of  appearances  of  the  arguments  of  F,-  in  the  expanded 
subtree  T;,  and  (■)  is  the  scalar  vector  product.  An  example  is  presented  in  table  4. 


Table  4:  Example  of  complexity  values  for  a  hierarchy  of  three  Boolean  functions  fO,  f1  and  f2. 
FO  (defun  FO  (aO  a1)  (or  (and  aO  a1)  (and  (nand  aO  aO) 

(NAND  Al  Al)))) 

FI  (defun  FI  (aO  Al)  (NAND  (OR  (AND  aO  A1)(F0  (NAND 
aO  aO)  (nand  Al  a])))(or  (and  aO  Al)  (and  (nand  aO 

aO)(NAND  Al  Al))))) 

F2  (defun  F2  (aO  a1)(F1  (adfO  (FO  dO  d1)  (FO  d2  d3))  d4)) 


Complexity 

FO 

FI 

F2 

Structural  (SC) 

11 

23 

43 

Executional  (EC) 

11 

34 

76 

In-line  Expanded  (IC) 

11 

39 

739 
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